<HTML>
<HEAD>
<TITLE>Supporting material for OOPSLA-2012 publication</TITLE>
<script type="text/javascript">

  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-1777267-5']);
  _gaq.push(['_trackPageview']);

  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();

</script>
</HEAD>
<BODY>
<H1>Open and Efficient Type Switch for C++</H1>
<P>by 
<A href="http://parasol.tamu.edu/~yuriys/">Yuriy Solodkyy</A>, 
<A href="http://parasol.tamu.edu/~gdr/">Gabriel Dos Reis</A> and 
<A href="http://parasol.tamu.edu/~bs/">Bjarne Stroustrup</A>
</P>
<H3><A href="http://doi.acm.org/10.1145/2384616.2384686">OOPSLA-2012 paper</A> supporting material</H3>

<H2>The Library</H2>

<H3><a name="typeswitch">Type Switch</a></H3>

<P>The library source code can be downloaded here: 
<A href="typeswitch-2013-02-18.zip">typeswitch-2013-02-18.zip</A>.<BR> <B>NOTE:</B> This version does not provide 
support of multi-threading. Please wait until next release, we are working on a 
thread-safe version. <A href="mailto:solodon+typeswitch@gmail.com" title="TypeSwitch Inquiry">Send me an e-mail</A> if you need it sooner.
</P>
<P>
The library is a subset of a larger <A href="#" title="Link to be supplied...">pattern-matching 
library</A> and only exposes the facilities discussed in the paper. It uses a 
smaller subset of C++11 than the general library and thus can be compiled on 
more compilers. It is also supposedly easier to understand since there is less 
code in it. Because it is a subset of a larger library, not every configuration 
option mentioned in <TT>config.hpp</TT> will be applicable to this subset.
</P>
<P>
To build various test cases with GCC (4.4 or later), simply invoke 
<TT>make</TT>. To do the same with Visual C++ (2010 or later), invoke 
<TT>build.bat</TT> from the command line. For the list of other supported 
build targets see the source of <TT>Makefile</TT> or <TT>build.bat</TT> 
respectively. Some of the source files are known to crash GCC 4.4.5 on my Fedora 
13 box, just delete them if they also crash your compiler -- the specific 
problems causing the crash were reported to GCC or solved in later versions 
already. 
</P>
<P>
If your class hierarchy has virtual functions in its root class, you can start 
using the library by simply including <TT>match.hpp</TT>. The only statements 
that would be available are: <TT>Match</TT>, <TT>EndMatch</TT>, <TT>Case</TT> 
and <TT>Otherwise</TT>. Other statements mentioned in the header depend on the 
part of the library not present here. The simplest example of library use can be 
found in <TT><A href="example01.html">example01.cpp</A></TT>. The exact example 
used in the paper can be found in <TT><A href="example03.html">example03.cpp</A></TT> 
and is only available under GCC at the moment because it depends on the compiler 
implementing C++11 initializer lists.
</P>
<P>
As a larger example you might want to look at the source code of a <A href="#codecmp">C++ pretty printer</A> we implemented for the <A href="http://parasol.tamu.edu/pivot/">Pivot framework</A>.
</P>
<!--
<H3><a name="matching">Pattern Matching</a></H3>

<P>The library source code can be downloaded here: <A href="pm.zip">pm.zip</A>. 
It addresses general pattern matching for C++ and is subject of a separate 
paper. We present the source code here only for completeness.
</P>
-->
<H2>Discussions</H2>
<P>There have been few discussions of the library on the internet with some 
relevant questions asked/answered:</P>

<UL>
<LI>2013-02-24: <A href="http://www.reddit.com/r/cpp/comments/18i8nf/open_and_efficient_type_switch_for/">reddit.com</A>
<LI>2013-02-13: <A href="http://isocpp.org/blog/2013/02/open-and-efficient-type-switch-for-c-solodkyy-dos-reis-and-stroustrup">isocpp.org</A>
</UL>

<H2>Synthetic Tests Performance Comparison</H2>
<P>
Synthetic tests were designed to compare the run-time performance of visitors 
and type switching in a setting where the actual pay-load functionality is the 
minimal (i.e. just returning a case number here), while the most of the time is 
spent on detecting the case. 
There are 24 test configurations, which are a combination of the following 4 
parameters described in details in the paper:
</P>

<UL>
<LI>3 use case scenarios: Repetitive, Random or Sequential
<LI>2 encodings: Open or Tag
<LI>2 syntax variations: Unified or Special
<LI>2 cases of presence or absence of forwarding to the base class in visitors.
</UL>

<P>
The result of the timing experiment can be seen in Excel Spreadsheet 
<a href="Timing.xlsx">Timing.xlsx</a>. Embedded version of the file can be seen 
below, however some features/information is not available when viewing in 
browser, so we suggest you download the file locally.
</P>
<P>
Please feel free to edit the four values that are allowed to be editable below 
to see the amount of cycles spent per iteration of a corresponding test by 
visitors and type switch. The first value can be in range 1-7 and controls the 
compiler configuration used for tests. The 3 other boolean parameters represent 
the 3 last parameters described above: Encoding, Syntax and Forwarding. 
The numbers hidden by the chart are copied from the corresponding sheet below by 
choosing compiler configuration. You can see it directly by clicking on the 
corresponding sheet.
</P>

<iframe width="100%" height="600" frameborder="0" scrolling="no" src="https://r.office.microsoft.com/r/rlidExcelEmbed?su=2783159172736703518&Fi=SD269FC48594575C1E!230&ak=t%3d0%26s%3d0%26v%3d!ADIMvZn3S8MyMdY&kip=1&wdAllowInteractivity=False&AllowTyping=True&ActiveCell='Visualization'!A8&wdHideGridlines=True&wdHideHeaders=True&wdDownloadButton=True"></iframe>

<P>
All the tests have been performed in the diagnostic boot of Windows 7, in which 
the operating system loads only the bare minimum of drivers and services. This 
limits the effect of other applications on the timed application as well as 
makes the timing numbers very sustainable: i.e. you can see that each test of a 
given configuration was performed 3 times, with numbers representing the amount 
of cycles per iteration almost always beeing the same or different by 1.
</P>

<H2>Real Code Performance Comparison</H2>
<P>
The C++ pretty printer that we implemented using both type switching (open 
encoding) and visitor design pattern was ran on a set of standard header files 
to print them out. The produced output of both program was byte-to-byte the 
same.
</P>
<P>
The running times for the entire execution of the printer (not including loading 
and termination of the application or parsing of the input file) are presented 
below for two consequitive experiments. The numbers correspond to 2467841 units 
per 1 second. X indicates the winner. The general observation is that on small 
files (e.g. those from C run-time library and few small C++ files) visitors win 
because the total number of nodes in AST and thus calls does not justify our 
set-up calls. We win on all the large standard headers, where the total number 
of calls can reach numbers indicated below in the memory usage chart. The 
rightmost column lists lines of codes in each input file after preprocessing and 
removing any empty lines, #line directives and #pragmas.
</P>
<P>
Visitor-based implementation of pretty printer is faster on files of 44--588 
lines of code, with average 136 lines per those inputs, where visitors win. On 
these input files it is faster by 1.17%--21.42% with an average speed-up of 
8.75%. Open type switch based implementation of pretty printer is faster on 
files of 144--9851 lines of code, with average 3497 lines per those input files, 
where open type switch wins. On these inputs it is faster by 0.18% -- 32.99% 
with an average speed-up of 5.53%.
</P>
<PRE>
            EXPERIMENT 1            #              EXPERIMENT 2            #
 Visitors    Input file TypeSwitch  #   Visitors    Input file TypeSwitch  #  Input file    LOC
==================================  #  ==================================  #  =================
 168609 | |  algorithm  |X| 165256  #   172249 | |  algorithm  |X| 165018  #  algorithm    5467
   2022 |X|  cassert    | |   2201  #     2031 |X|  cassert    | |   2211  #  cassert        47
   3989 |X|  cctype     | |   4227  #     4072 |X|  cctype     | |   4406  #  cctype         98
   2182 |X|  cerrno     | |   2495  #     2230 |X|  cerrno     | |   2498  #  cerrno         50
   2939 |X|  cfloat     | |   3279  #     3010 |X|  cfloat     | |   3217  #  cfloat         61
   1899 |X|  climits    | |   2205  #     2030 |X|  climits    | |   2456  #  climits        44
  14830 |X|  cmath      | |  15740  #    15274 |X|  cmath      | |  15452  #  cmath         338
   2838 |X|  csetjmp    | |   3162  #     3045 |X|  csetjmp    | |   3225  #  csetjmp        65
   2534 |X|  csignal    | |   2803  #     2626 |X|  csignal    | |   2793  #  csignal        53
    751 |X|  cstdarg    | |    910  #      747 |X|  cstdarg    | |    907  #  cstdarg        51
   2339 |X|  cstddef    | |   2682  #     2542 |X|  cstddef    | |   2614  #  cstddef        55
   8505 |X|  cstdio     | |   8828  #     8896 | |  cstdio     |*|   8623  #  cstdio        195
   7810 |X|  cstdlib    | |   7938  #     8399 | |  cstdlib    |*|   7794  #  cstdlib       195
   6412 | |  cstring    |X|   6208  #     6559 | |  cstring    |X|   6240  #  cstring       144
   4184 |X|  ctime      | |   4511  #     4498 |X|  ctime      | |   4584  #  ctime         104
  17970 | |  cwchar     |X|  17030  #    18322 | |  cwchar     |X|  17204  #  cwchar        430
   5313 |X|  cwctype    | |   5797  #     5428 |X|  cwctype    | |   5855  #  cwctype        98
 179706 | |  deque      |X| 175360  #   186641 | |  deque      |X| 176304  #  deque        5477
   4551 |X|  exception  | |   4859  #     4619 |X|  exception  | |   5044  #  exception     107
  15926 |X|  functional | |  16286  #    16069 |X|  functional | |  16385  #  functional    588
  44995 | |  iosfwd     |X|  36451  #    45638 | |  iosfwd     |X|  36346  #  iosfwd        981
  76075 | |  iterator   |X|  72553  #    77843 | |  iterator   |X|  73809  #  iterator     2356
  52246 | |  limits     |X|  51984  #    54408 | |  limits     |X|  51802  #  limits       1521
 190473 | |  list       |X| 186927  #   195184 | |  list       |X| 186792  #  list         5981
 102173 | |  memory     |X|  98085  #   103635 | |  memory     |X|  98894  #  memory       3171
   5789 |X|  new        | |   6077  #     8079 | |  new        |*|   6075  #  new           156
  79381 | |  numeric    |X|  79238  #    78918 | |  numeric    |X|  76705  #  numeric      2481
 314098 | |  queue      |X| 300355  #   310009 | |  queue      |X| 299821  #  queue        9851
 181218 | |  stack      |X| 176946  #   181784 | |  stack      |X| 180710  #  stack        5563
 132298 | |  stdexcept  |X| 131237  #   130425 | |  stdexcept  |X| 127400  #  stdexcept    3907
   5680 |X|  typeinfo   | |   6125  #     5718 |X|  typeinfo   | |   6191  #  typeinfo      145
  41860 | |  utility    |X|  39939  #    40407 | |  utility    |X|  40119  #  utility      1095
  82865 | |  valarray   |X|  80094  #    80487 | |  valarray   |X|  80185  #  valarray     1551
 197234 | |  vector     |X| 188669  #   194703 | |  vector     |X| 188664  #  vector       5985
==================================  #  ==================================  #  =================
</PRE>

<H2>Memory Usage</H2>
<P>
The picture below represents memory usage as well as cache hits and misses for 
the run of our pretty printer on 'queue' standard library header (it has the 
largest LOC after preprocessing in our test set).
</P>

<P align="center"><IMG src="Memory.png"></P>

<P>
There were 8 match statements in it marked A-H. The info [N/M] next to the 
letter indicates the actual number of different types (actually vtbl-pointers) 
that came through that match statement - N; and the amount of case clauses the 
match statement had - M (we use it as an estimate of types). N is also the 
number of cases the corresponding match statement had to be executed 
sequentially (instead of a direct jump).
</P>
<P>
Memory is measured in bytes. The blue part corresponds to our cache and the 
maroon - to the hash table that keeps mapping of elements to necessary info. 
Transparent parts of both colors indicate the allocated extra memory that is not 
holding any data.  The black box within the blue part also indicates the amount 
of entries in the cache that are allocated for only one vtbl pointer and thus 
never result in a cache miss. The non-transparent part without black box 
represents the percentage of vtbl pointers that have to share their cache entry 
with at least one other vtbl-pointer and thus may result in collisions during 
access.
</P>
<P>
The actual number of hits and misses for each of the match statements is 
indicated on top of the corresponding column. The sum of them is the total 
amount of calls made. Hits indicate situation when we found entry in cache and 
didn't have to make roundtrip to the hash-table to get it. Misses indicate the 
number of cases during actual run we had to pick the entry from the hash table 
and update the cache with it. The number of misses is always larger then or 
equal to N.
</P>
<P>
The last 3 match statements have memory pre-allocated for 8 types - that's the 
minimum size we preallocate, even when only one type will show up. Just an 
internal optimization. The rest of the entries allocate the amount of memory 
proportional to the larger of the two numbers M and N (actually the closest 
power of 2 after that maximum). The reason for that is also an internal 
optimization to avoid cache reconfigurations.
</P>
<P>
The table doesn't have to be hash table and can be implemented with any 
other container i.e. sorted vector, map etc. that let us find quickly by a given 
vtbl-ptr the data associated with it. If we implement it with sorted vector, the 
red part will shrink to only the non-transparent part.
</P>

<H2><a name="codecmp">Code Comparison</a></H2>

<P>The table below presents side-by-side comparison of the code implementing the 
C++ pretty-printer using visitors (left) and pattern-matching (right).</P>

<P>Both files were adopted from their original form in order to render as little 
differences as possible in order to let the reader see the exact correspondence 
in functionality. In particular the following changes were made:</P>

<UL>
<LI>The formal parameter of all the visit methods was renamed to 'matched' in 
order to correspond to the name of the variable with the same name we introduce 
in each case clause.
<LI>The variable-patterns of the Case-clauses were renamed to correspond to 
names of member-functions on the target object used to obtain their values. The 
visitor implementation calls those functions explicitly in corresponding places.
<LI>All the visitors were defined inside corresponding functions that invoke 
them with all the visit methods defined inline in order to match the structure 
of match statements as much as possible. This was not the case for visitors with 
large visit methods.
<LI>Any common code for both files was put into separate file to not obstruct 
comparison.
</UL> 

<P>The files were 798 lines (pattern matching) and 893 lines (visitors) before 
these changes were applied, making the pattern matching code about 11% shorter. 
Actually compared files are 719 lines long (pattern matching) and 732 lines long 
(visitors). Byte size of the pattern-matching source file is 34761 and visitors 
file 40464 bytes, making the pattern-matching source about 14% smaller.</P>

<P>
<table>
<tr><td><a href="printer_visitors.html">printer_visitors.cpp</a></td>
    <td>Visitors based implementation (left file in comparison)</td></tr>
<tr><td><a href="printer_matching.html">printer_matching.cpp</a></td>
    <td>Pattern-matching based implementation (right file in comparison)</td></tr>
<tr><td><a href="visitors.html">visitors.hpp</a></td>
    <td>Excerpt from Pivot's header with visitor definition</td></tr>
<tr><td><a href="match_ipr.html">match_ipr.hpp</a></td>
    <td>Definition of pattern-matching bindings</td></tr>
</table>
</P>

<iframe name="Comparison" src="Visitors-Matching.html" width="100%" height="14350" ></iframe>
<!--
<H2>Errata</H2>

<P>
The discussion of the following work did not appear in the sent draft due to version 
control skew. We are aware of this work based on the suggestion from a previous 
review.<BR>
<EM>Urs Holzle, Craig Chambers and David Ungar ``Optimizing Dynamically-Typed 
Object-Oriented Languages with Polymorphic Inline Caches'' (ECOOP'91)</EM>.
</P>
<P>
PICs generate code at run-time, our approach neither requires re-compilation, 
nor re-linking nor even any computations at dynamic linking time. Besides, PICs 
are based on decision trees and we show that they are inferior to jump tables in 
application to type switching. Our setting is quite different from theirs though 
- they are mapping arbitrary number of receiver types to arbitrary number of 
implementations of a specific dynamically looked up method, called at that call 
site. We are mapping an arbitrary number of receivers to a fixed amount of jump 
targets. We are thus able to generate code at compile time that can be executed 
two ways: sequentially and directly with a jump. The novelty of specifically 
memoization device is thus in a code layout that can memoize its own sequential 
execution.
</P>
<P>
A small observation about Inline Caches (IC) that PICs are based on: virtual 
function calls on modern architectures are essentially executed with IC, because 
of the hardware call target prediction. This can be easily seen from our 
repetitive benchmark where calls on different objects of the same reciever type 
are twice faster than virtual calls on objects of different reciever types. The 
repetitive benchmark was specifically designed to compare our performance for 
this specific case. Holzle also notes that inline caches are effective only if 
the receiver type (and thus the call target) remains relatively constant at a 
call site.
</P>
<P>
We decided not to discuss in details multiple and predicated dispatching in the 
context of this work as based on previous reviews it was significantly side 
tracking the discussion, giving our work a connotation that we are trying to 
solve the expression problem. Instead we discuss in greater details why type 
switch is not a solution to the expression problem and why it is better than 
visitor design pattern.
</P>
<P>
We also forgot to mention that the class hierarchy tests had both tests 
involving multiple translation units and single translation units in them. 
Some of the class hierarchies were so large neither GCC nor Visual C++ could 
compile them when put into a single translation unit due to the limitations on 
object file size. This was happening primarily due to the number of virtual 
tables that had to be generated and put into a single object file. We thus 
implemented a possibility of splitting the tests into multiple translation 
units in order to make the setting as close as possible to the real world use.
</P>
-->
</BODY>
</HTML>